Trigger 1 Pager Duty alert after multiple failures from New Relic

gregu · May 15, 2020, 2:21pm

I’m trying to configure Pager Duty as a sort of gate keeper to filter out low quality violations from New Relic ping tests. My goal is to trigger a Pager Duty alert through the escalation policy chain only if the server is down for more than 10 minutes. Right now we’re getting alerts in the middle of the night if the server is down for a few seconds.

I have my New Relic Alert Policy incident preferences set to open an incident every time a condition is violated and I’m running tests every 5 minutes. The Synthetic is set to search for an element that does not exist on the page to specifically trigger a violation for testing purposes only. I’ve configured an Event Rule so when the body contains “Ping” the following actions are performed:

* Route to [Backend Critical (New Relic)](https://24datainc.pagerduty.com/services/XXXX)
* Suppressing until more than ` 2 alerts` received within `12 minutes`
* Then stop processing

I am artificially forcing 2 violations in a row New Relic with the hope that Pager Duty will filter out the first warning and trigger an incident to the escalation policy on the second but it seems like
a. New Relic is only sending Pager Duty one notification and
b. Pager Duty is receiving that first notification from NR and immediately sending it through the escalation policy.

Any advice please?

Inactive-Member-94246458 · May 15, 2020, 2:21pm

It sounds like you’re on the right track, but if PagerDuty is triggering an incident and your rule is set to trigger an incident only when 2 events are received within 12 minutes, it sounds like they might not be hitting that rule. Are you sending incidents to the key/endpoint listed in Configuration > Event Rules > Incoming Data Source? If so, could you point me to an incident that was supposed to hit a rule but didn’t?

gregu · May 15, 2020, 2:21pm

Thanks for the reply Tom and thanks for pointing me in the right direction. I was using the wrong integration but now that I’m using the correct one All events are suppressed. There’s something I’m not understanding here.

The most current alert at 10:58 today is a good example.

55%20AM

gregu · May 15, 2020, 2:21pm

I’ve had three failures so far but still no alerts from Pager Duty.

36%20AM

jcurreee · May 15, 2020, 2:55pm

Since you’re just now setting things up, is there a chance you’ve got a schedule set up with gaps in it? If so, you may want to directly add yourself to the escalation policy while testing to ensure an incident is triggered, as somebody must be on-call for a service in order for an incident to be created.

There’s some other common issues you can check for here:

/forum/t/no-incident-was-triggered-in-pagerduty-what-might-have-happened/1621

gregu · May 15, 2020, 2:21pm

Thanks for the advice Jonathan. I’m not using a schedule but rather I have myself directly added to the escalation policy at every stage so that I don’t bother my responders with alerts while I’m getting set up.

Malcolm · May 15, 2020, 2:21pm

Hi Greg,

We’ll have to do some inbound event troubleshooting to investigate this further for you. We wouldn’t do such troubleshooting in our Community forums but I would recommend reaching out to Support over email for them to investigate further. Feel free to reference this Community thread.

We’ll be sure to let you know any information needed to investigate there! The team will let you know any further info they may need to investigate

simonfiddaman · May 15, 2020, 2:21pm

Hi @gregu,

Looking at the suppression message, that will always match your SFL Form policy, as you have a suppression rule which matches Form or other things.

Cheers,
@simonfiddaman

gregu · May 15, 2020, 2:21pm

Thank you @simonfiddaman, I changed my settings last week after I found out that I could achieve what I wanted through New Relic. Originally though there was an option in the last line of the Event Rules to suppress incident creation unless unless Pager Duty received X alerts within Y minutes, then create an incident. That option seems to have been removed.

Inactive-Member-56283191 · May 15, 2020, 2:21pm

This was a recent design decision to remove that option in the catch-all rule; the idea is that by definition a “catch all” rule shouldn’t have criteria in it.

If this type of filtering is required, you can get the same functionality by adding a similar filter as your last rule (before the catch-all rule).